Group 28: HCDR

Home Credit Default Risk (HCDR)

The course project is based on the Home Credit Default Risk (HCDR) Kaggle Competition. The goal of this project is to predict whether or not a client will repay a loan. In order to make sure that people who struggle to get loans due to insufficient or non-existent credit histories have a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

Some of the challenges

  1. Dataset size
    • (688 meg uncompressed) with millions of rows of data
    • 2.71 Gig of data uncompressed

Kaggle API setup

Kaggle is a Data Science Competition Platform which shares a lot of datasets. In the past, it was troublesome to submit your result as your have to go through the console in your browser and drag your files there. Now you can interact with Kaggle via the command line. E.g.,

! kaggle competitions files home-credit-default-risk

It is quite easy to setup, it takes me less than 15 minutes to finish a submission.

  1. Install library

For more detailed information on setting the Kaggle API see here and here.

Dataset and how to download

Back ground Home Credit Group

Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.

Home Credit Group

Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Background on the dataset

Home Credit is a non-banking financial institution, founded in 1997 in the Czech Republic.

The company operates in 14 countries (including United States, Russia, Kazahstan, Belarus, China, India) and focuses on lending primarily to people with little or no credit history which will either not obtain loans or became victims of untrustworthly lenders.

Home Credit group has over 29 million customers, total assests of 21 billions Euro, over 160 millions loans, with the majority in Asia and and almost half of them in China (as of 19-05-2018).

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Data files overview

There are 7 different sources of data:

Downloading the files via Kaggle API

Create a base directory:

DATA_DIR = "../../../Data/home-credit-default-risk"   #same level as course repo in the data directory

Please download the project data files and data dictionary and unzip them using either of the following approaches:

  1. Click on the Download button on the following Data Webpage and unzip the zip file to the BASE_DIR
  2. If you plan to use the Kaggle API, please use the following steps.

Importing Libraries

Data files overview

Data Dictionary

As part of the data download comes a Data Dictionary. It named HomeCredit_columns_description.csv

Application train

Datasets

Loading Data from csv files

Exploratory Data Analysis

EDA for Application Train

Summary of Application train

Missing data for application train

Observations

Distribution of NAME_CONTRACT_TYPE attribute with TARGET class

Column Types

Observations:

We could see that there are 16 categorical attributes in application train file and the remaining are numerical attributes

Observations

Above output depicts the number of unique values in categorical attributes in application train data.

Anamolies in Application train of DAYS_EMPLOYED attribute

Observation:

Distribution of the target column

Observations

Above graph shows that majority of people would be able to repay the loan.

Correlation with the target column

Observations

Applicants Age

Observations

Above histogram shows that people in between the age of 35 to 40 have majorly applied for loan and this could be an useful factor for prediction.

Applicants occupations

Observations

Above countplot shows that the most of the applicants are Laborers and this could be an intresting area to research in future phase.

EDA for external sources

Performing EDA on external sources attributes to check whether these attributes would be helpful or not.

Observation

Based on the above correlation matrix, we can clearly state that External Source 1,2,3 and Days_Birth atrributes are not highly correlated with Target Attribute as the high value is 0.6. So, we donot drop these features in preprocessing since these may be useful in predicting the target.

EXT_SOURCE_1, EXT_SOURCE_2, EXT_SOURCE_3 are distributed almost equally with target variables.

Categorical Variables

The majority of borrowers less than 15 years of employment experience.

EDA for Bureau

Summary of Bureau

Missing data for bureau

Bureau Missing Values

Correlation

Column Types

EDA for Bureau Balance

Summary of Bureau Balance

Missing data for bureau

Correlation

Column Types

EDA for Credit Card Balance

Summary of Credit Card Balance

Missing data for credit_card_balance

Plotting Count Plots for Categorical Attributes

Correlation

Column Types

EDA for Installments Balance

Summary of Installments Balance

Missing data for Installments payments

Correlation

Column Types

EDA for Previous Application

Summary of Previous Application

Missing data for Previous Application

Correlation

Column Types

EDA for POS_CASH_balance

Summary of POS_CASH_balance

Missing data for POS_CASH_balance

Most of Contracts in POS_CASH_BALANCE data are in active status

Correlation

Column Types

Dataset questions

Unique record for each SK_ID_CURR

previous applications for the submission file

The persons in the kaggle submission file have had previous applications in the previous_application.csv. 47,800 out 48,744 people have had previous appications.

Histogram of Number of previous applications for an ID

Can we differentiate applications by low, medium and high previous apps?
* Low = <5 claims (22%)
* Medium = 10 to 39 claims (58%)
* High = 40 or more claims (20%)

Joining secondary tables with the primary table

In the case of the HCDR competition (and many other machine learning problems that involve multiple tables in 3NF or not) we need to join these datasets (denormalize) when using a machine learning pipeline. Joining the secondary tables with the primary table will lead to lots of new features about each loan application; these features will tend to be aggregate type features or meta data about the loan or its application. How can we do this when using Machine Learning Pipelines?

Joining previous_application with application_x

We refer to the application_train data (and also application_test data also) as the primary table and the other files as the secondary tables (e.g., previous_application dataset). All tables can be joined using the primary key SK_ID_PREV.

Let's assume we wish to generate a feature based on previous application attempts. In this case, possible features here could be:

To build such features, we need to join the application_train data (and also application_test data also) with the 'previous_application' dataset (and the other available datasets).

When joining this data in the context of pipelines, different strategies come to mind with various tradeoffs:

  1. Preprocess each of the non-application data sets, thereby generating many new (derived) features, and then joining (aka merge) the results with the application_train data (the labeled dataset) and with the application_test data (the unlabeled submission dataset) prior to processing the data (in a train, valid, test partition) via your machine learning pipeline. [This approach is recommended for this HCDR competition. WHY?]

I want you to think about this section and build on this.

Roadmap for secondary table processing

  1. Transform all the secondary tables to features that can be joined into the main table the application table (labeled and unlabeled)
    • 'bureau', 'bureau_balance', 'credit_card_balance', 'installments_payments',
    • 'previous_application', 'POS_CASH_balance'

Multiple condition expressions in Pandas

So far, both our boolean selections have involved a single condition. You can, of course, have as many conditions as you would like. To do so, you will need to combine your boolean expressions using the three logical operators and, or and not.

Use &, | , ~ Although Python uses the syntax and, or, and not, these will not work when testing multiple conditions with pandas. The details of why are explained here.

You must use the following operators with pandas:

Missing values in prevApps

feature engineering for prevApp table

Feature Engineering for All Tables

Choosing Highly correlated features from all input datasets

feature transformer for prevApp table

Feature Aggregator

Prepare Datasets

Teritary Datasets

The tertiary datasets or tables refer to bureau_balance, POS_CASH_balance, instalments_payments, credit_card_balance

Third Tier Custom - Domain Knowledge based features

Any domain based features that will aid in a better model have been included here. In the table credit card balance the payment difference can be value to predict risk.

Third Tier Datasets Numerical feature Aggregation

Feature Aggregation for tertiary datasets

Merge Tertiary level data with secondary level data

Merging the aggregated features for pos_cash_bal , installments_pmnts , credit card balance with Previous application

Merging the aggregated features the dataset Bureau Balance with Bureau as per the data model.

Secondary Datasets

Second Tier datasets Numerical feature aggregation

Second Tier Custom - Domain Knowledge based features

Primary Datasets

Merge Aggregated Dataset With Tier 1 Tables - Train and Test

Prior to merging with the Primary data, we will be dropping columns with more than 50% missing values because they are not reliable parameters.

Merging Secondary level data with Application Train&Test Data

Custom - Domain Knowledge based Features

Handle remaining missing values and null values

Fill NA values with 0, Execute Fillna(0)

Total Numeric features in Application Train data

Total Categorical features in the application train data.

Deductions from the list of dtypes of the appsTrainDF

Pipeline

HCDR Preprocessing

Column Selector

Numerical Attributes

Numerical Pipeline definition

OHE when previously unseen unique values in the test/validation set

Train, validation and Test sets (and the leakage problem we have mentioned previously):

Let's look at a small usecase to tell us how to deal with this:

This last problem can be solved by using the option handle_unknown='ignore'of the OneHotEncoder, which, as the name suggests, will ignore previously unseen values when transforming the test set.

Here is a example that in action:

# Identify the categorical features we wish to consider.
cat_attribs = ['CODE_GENDER', 'FLAG_OWN_REALTY','FLAG_OWN_CAR','NAME_CONTRACT_TYPE', 
               'NAME_EDUCATION_TYPE','OCCUPATION_TYPE','NAME_INCOME_TYPE']

# Notice handle_unknown="ignore" in OHE which ignore values from the validation/test that
# do NOT occur in the training set
cat_pipeline = Pipeline([
        ('selector', DataFrameSelector(cat_attribs)),
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
    ])

Categorical Pipeline definition

Create Data Preparation Pipeline

With Feature union, combine numerical and categorical Pipeline together to prepare for Data pipeline

Baseline model with Imbalanced Dataset

Create Train and Test Datasets

Define pipeline

Perform cross-fold validation and Train the model

Split the training data to 3 fold to perform Crossfold validation

Calculate Metrics

Confusion matrix

AUC (Area under ROC curve)

Precision Recall Curve

Baseline Model - With sampled data

To get a baseline, we will use some of the features after being preprocessed through the pipeline. The baseline model is a logistic regression model. Since 'No default and Default' target records are not balanced in trainging set, we are going to resample the minority class("Default with target value 1") to balance the input dataset

Resample Minority class

Resampling should be performed only in the train dataset, to avoid overfitting and data leakage.

After resampling, both default and non-default classes are balanced

Create a Pipeline with Baseline Model

Create crossfold validation splits

Split the training data to 5 fold to perform Crossfold validation

Baseline Prediction

Baseline metrics

Accuracy, AUC score, F1 Score and Log loss used for measuring the baseline model

Confusion matrix

AUC (Area under ROC curve)

Precision Recall Curve

Various Classification algorithms were used to compare with the best model. Following metrics were used to find the best model

Classifiers

Hyper-parameters for all models specified above

Logistic Regression Model

RandomForest

SVM(Support Vector Machines)

It was taking infinite amount of time to execute.

XGBoost

Model Validation

Feature Importance based on all Models

AUC(Area Under ROC Curve)

Precision Recall Curve

Confusion Matrix

Final Results

Kaggle submission

Best Pipeline for submission

Voting Classifier to predict best results based on best Classifier Probability

Submission File Prep

For each SK_ID_CURR in the test set, you must predict a probability for the TARGET variable. The file should contain a header and have the following format:

SK_ID_CURR,TARGET
100001,0.1
100005,0.9
100013,0.2
etc.

XGBoosT best pipeline for submission

Deep Learning

Deep Learning Model Pipeline & Workflow

Deep learning is a sub field of machine learning. Deep learning is about learning from past data using artificial neural networks with multiple hidden layers (2 or more hidden layers). Deep neural networks uncrumple complex representation of data step-by-step, layer-by-layer (hence multiple hidden layers) into a clear representation of the data. Artificial neural networks having one hidden layer apart from input and output layer is called as multi-layer perceptron (MLP) network.

Deep Learning Pipeline Model workflow

Imports

Single layer Neural Network

Data Preparation Transform data using data pipeline and converted into Tensor for neural network pipeline.

One layer : Linear and Sigmoid Activate Function

Signmoid layer is used to create the probability of prediction.

Train Neural Network

Evaluation of Neural Network model

Multi Layer NN model with custom Hinge & CXE Loss

Model Definition

The model contain one linear layer, one hidden layer with Relu function and sigmoid function for probability prediction.

Train Model

Execute Model

Multi layer Neural Network Model results below:

Plot Convergence

Kaggle submission via the command line API

Submitted Via website

Kaggle Submission for Neural Networks Model.

Kaggle Submission Score for Multilayer Neural Network Model

Write-up

Home Credit Default Risk

Team Members(Group 28):

Sno                                          *Name                                         *Email                                         
1.                                          Vamsee Krishna Sai Naramsetty                                         vnarams@iu.edu                                         
2.                                          Harishanker Brahma Kande                                          hkande@iu.edu                                         
3.                                          Pranay Reddy Dasari                                          pdasari@iu.edu                                         
4.                                          Jaswanth Kumar Ranga                                          jranga@iu.edu                                         



Overview

The course project is based on the Kaggle Competition on Home Credit Default Risk (HCDR). This project's purpose is to anticipate if a client will repay a loan. Home Credit uses a number of alternative data—including telco and transactional information—to estimate their clients' repayment ability in order to ensure that people who struggle to secure loans owing to weak or non-existent credit histories have a pleasant loan experience.

Abstract

The purpose of this project is to create a machine learning model/deep learning model that can predict consumer behavior during loan repayment.

In this phase our goal is to build a multi-layer neural network model in Pytorch and use Tensorboard to visualize real-time training results.This phase focused on building high performance Neural Networks and monitoring error generalization with early stopping technique and evaluating the model performance by monitoring through loss functions such as CXE and Hinge Loss.We did built 2 models, First model contains one linear layer with Relu function for probability prediction and the second model contain one linear layer, one hidden layer with Relu function and sigmoid function for probability prediction. Using Tensorboard we visualize the CXE loss for training data for each epoch.

Our results in this phase for multi layer neural network model the AUC scores are 0.588 for train data and 0.5172 for test data. For single layer model the AUC score is 0.7558 for test data. For our submission in kaggle we received a public score of 0.512 and private score of 0.510.

Project Description

Home Credit is an international non-bank financial institution, which primarily focuses on lending people money regardless of their credit history. Home credit groups aim to provide positive borrowing experience to customers, who do not bank on traditional sources for pertaining loans. Hence, Home Credit Group published a dataset on Kaggle website with the objective of identifying and solving unfair loan rejection.

The purpose of this project is to create a machine learning model that can predict consumer behavior during loan repayment. Our task in this phase is to create a pipeline to build a baseline machine learning model using Logistic Regression algorithm. The resultant model will be evaluated with various performance metrics in order to build a better model. Companies can be able to rely on the output of this model to identify if loan is at risk to default. The new model built would help companies to avoid losses and make significant profits and will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

The results of our machine learning pipelines will be measured using the follwing metrics;

The pipeline results will be compared and ranked using the appropriate measurements and the most efficient pipeline will be submitted to the HCDR Kaggle Competition.

Workflow

For this project, we are following the proposed workflow as mentioned below.

Data Description

Overview The full dataset consists of 7 tables. There is 1 primary table and 6 secondary tables.

Primary Tables

Secondary Tables

EDA(Exploratory Data Analysis)

Exploratory data analysis is important to this project because it helps to understand the data and it allows us to get closer to the certainty that the future results will be valid, accurately interpreted, and applicable to the proposed solution.

In the phase-1 of our project eda helped us to look at the summary statistics on each table and focussing on missing data, Outliers and aggregate functions such as mean, median etc and visual representation of features for better understanding of the data.

For identifying missing data we made use of categorical and numerical features. Specific features have been visualized based on their correlation values. The highly correlated features were used to plot the density to evaluate the distributions in comparison to the target variable. We used different plots such as countplot, heatmap, densityplot, catplot etc for visualizing our analysis.

Key Observations:

Feature Engineering and transformers

Feature Engineering is important because it is directly reflected in the quality of the machine learning model, This is because in this phase new features are created, transformed, dropped based on aggregator functions such as max, min, mean, sum and count etc.

Domain knowledge-based features, which assist increase a model's accuracy, are an important aspect of any feature engineering process. The first step was to determine which of these were applicable to each dataset. Credit card balance after payment based on due amount, application amount average, credit average, and other new custom features were among them. Available credit as a percentage of income, Annuity as a percentage of income, Annuity as a percentage of available credit are all examples of percentages.

The next stage was to find the numerical characteristics and aggregate them into mean, minimum, and maximum values. During the engineering phase, an effort was made to use label encoding for unique values greater than 5. However, to reduce the amount of code required to perform the same functionality, a design choice was taken to apply OHE at the pipeline level for specified highly correlated variables on the final merged dataset.

Extensive feature engineering was carried out by experimenting with several modeling techniques using main, secondary, and tertiary tables before settling on an efficient strategy that used the least amount of memory. For Tier 3 tables bureau balance, credit card installment, installment payments, and point of sale systems cash balance, the first attempt entailed developing engineered and aggregated features. This was then combined with Tier 2 tables, such as prev application balance with credit card installment, installment payments, and point of sale systems cash balance, as well as aggregated features, to create prev application balance. Along with the core dataset application train, a flattened view comprising all of the aforementioned tables was integrated. As a result, there were a lot of redundant features that took up a lot of memory.

Attempt 2 involved creating custom and aggregated features for tier 3 tables and merging with tier 2 tables based on the primary key provided, which was later “extended” to the tier1 tables based on the additional aggregated columns. This approach created less duplicates, was optimized and occupied less memory by using a garbage collector after each merge.

A train dataframe was created by merging the Tier3, Tier2, and Tier1 datasets. There were extra precautions made to verify that no columns had more than 50% of the data missing. The characteristics were engineered and included in the model with modest divides to assist test the model, however the accuracy was low. However, for Random forest and XGBoost, employing these combined features in conjunction with acceptable splits throughout the training face resulted in improved accuracy and reduced the risk of overfitting. Label encoding for unique categorical values in all categorical fields, not just a few, will be the focus of future research and trials.

Pipelines

Phase-1

Logistic regression model is used as a baseline Model, since it's easy to implement yet provides great efficiency. Training a logistic regression model doesn't require high computation power. We also tuned the regularization, tolerance, and C hyper parameters for the Logistic regression model and compared the results with the baseline model. We used 5 fold cross fold validation with hyperparameters to tune the model and apply GridSearchCV function in Sklearn.

Below is the workflow for the model pipeline.

Phase-2

In Phase 1, we used the Logistic regression model as the baseline model since it didn't take a lot of computing resources and was simple to execute. We also used customized logistic models with a balanced dataset to increase the predictiveness of our model. In phase 2, we did look at different classification models to see if we can improve our forecast. Our main focus is on boosting algorithms, which are believed to be extremely efficient and relatively fast. The modeling workflow for phase 2 is depicted in the diagram below. We used XGBoost, RandomForest, and SVM in our research.

Below is the reason for choosing the mentioned models.

Boosting algorithms can overfit if the number of trees is very large. We did two submission in Kaggle, one using Voting Classifier and the other one with best classifier i.e. XGBoost. A Voting Classifier is a machine learning model that trains on an ensemble of various models and predicts an output based on their highest probability of chosen class as the output. We have chosen soft voting instead of hard voting since the soft voting predicts based on the average of all models.

Phase-3

Below is the workflow for the multi layer neural network model pipeline.

Below is the pipeline for the Multi layer neural network model.

Hyperparameters Used

Below are the hyperparameters we used for training different models:

# Arrange grid search parameters for each classifier
params_grid = {
        'Logistic Regression': {
            'penalty': ('l1', 'l2'),
            'tol': (0.0001, 0.00001), 
            'C': (10, 1, 0.1, 0.01),
        }
    ,
        'Support Vector' : {
            'kernel': ('rbf','poly'),     
            'degree': (4, 5),
            'C': ( 0.0001, 0.001),   #Low C - allow for misclassification
            'gamma':(0.01,0.1,1)  #Low gamma - high variance and low bias
        }
    ,
        'XGBoost':  {
            'max_depth': [3,5], # Lower helps with overfitting
            'n_estimators':[200,300],
            'learning_rate': [0.01,0.1],
            'colsample_bytree' : [0.2], 
        },                      #small numbers reduces accuracy but runs faster 

        'RandomForest':  {
            'max_depth': [5,10],
            'max_features': [15,20],
            'min_samples_split': [5, 10],
            'min_samples_leaf': [3, 5],
            'bootstrap': [True],
            'n_estimators':[100]},
    }

Best Parameters for All models

Logistic Regression

Random Forest

XGBoost Classifier

Experimental results

Traditional Models Below is the resulting table for the results on the given dataset.

Deep Learning

Single Layer Neural Network Multi Layer Neural Network

Feature Importance

Random Forest

XGBoost Classifier

Leakage Problem:

Proper measures has been taken in order to reduce the leakage problems by making use of Cross validation folds while training the model and by allocating specific amount of data to the validation set in training data due to which we feel we did take appropriate measures to handle the leakage problem during all the phases especially while splitting the data we are making sure to drop the target variable and then try to predict the model. Also we did make sure to perform OHE for categorical attributes and standard scaler for numerical attributes along with these we made use of imputing techniques such as most frequent one for categorical attributes and mean for numerical attributes, Therefore considering all these steps we do feel we handled the data leakage problem throughout the project.

Discussion of Results

Based on the models discussed above, XGBoost stood out as the best predictive model using the top 183 features with 75.37% ROC score and followed by Logistic regression and the worst performance by Multi layer neural network with 59.34% AUC score.

* Logistic Regression : This model was chosen as the baseline model trained with both balanced and imbalanced dataset with feature engineering. The training accuracy for this model 70.05% and test accuracy as 69.84%. A 75.18% ROC score resulted with best parameters for this model.

* XGBoost : By far this model resulted in the best model. Both in terms of timing and accuracy for the selected features and balanced dataset. The accuracy of the training and test are 86.00% and test 78.65%. Test ROC under the curve is 75.37%.

* Random Forest : On our last decision tree models, Random Forest produced training accuracy of 85.90% and test accuracy of 78.77%. Test ROC score came out as 73.40%.

* Multi layer Neural Network: By far this is the model which has been underperforming when compared to traditional models as it is resulting in 51.72% AUC score and for single layer it is resulting in 74.8% AUC score, Multi Layer neural network is underperforming due to lack of selecting best features and identifying the accurate no of hidden layers could be the possible reasons.

TensorBoard Results

Single Layer Neural Network Model results:

We can clearly notice that from the tensor board results that the loss for the train data has been converging at 300(approx) epoch out of 500 epochs.

Multi layer Neural Network Model results below:

We can clearly notice that from the tensor board results that the gradual decrease in loss for the train data during intial epochs. Thereafter the loss has been converged.

Problems faced

The problem encountered apart from the accuracy of the model include:

Conclusion

Our implementation using ML models to predict if an applicant will be able to repay a loan was successful. Extending from the phase-1's simple baseline model, data modelling with feature aggregation, feature engineering, and using various data preprocessing pipeline both increased & reduced efficiency of models. Models used for prediction were Logistic Regression , ensemble model approaches using gradient boosting, Xgboost, Random forest and SVM. In the current phase we did try to implement Multi layer neural network model using Pytorch.

Our best performing model was XGBoost with the best AUC score of 75.37%, The lowest performing model is Multi layer neural network model with 51.72 % , Our best score in Kaggle submission for XGBoost submission is 0.72922 private and 0.72657 for public and for voting classifier the score is 0.75709 private and 0.75885 for public, However we did believe that Multi layer neural network model would result in higher AUC score, However it has been underperforming compared to traditional models and the AUC score is 0.510 private and 0.512 for public.

Kaggle Submission

Please provide a screenshot of your best kaggle submission for traditional & Multi layer neural network model.

References

Some of the material in this notebook has been adopted from following

https://www.kaggle.com/competitions/home-credit-default-risk/data

https://www.kaggle.com/gemartin/load-data-reduce-memory-usage

https://towardsdatascience.com/a-machine-learning-approach-to-credit-risk-assessment-ba8eda1cd11f

https://medium.com/@dipti.rohan.pawar/correlation-statistical-analysis-9471411f0431

https://scikit-learn.org/stable/modules/generated/sklearn.utils.resample.html

https://www.kaggle.com/willkoehrsen/start-here-a-gentle-introduction/notebook

https://juhiramzai.medium.com/introduction-to-credit-risk-modeling-e589d6914f57

https://scikit-learn.org/stable/modules/generated/sklearn.utils.resample.html

https://www.analyticsvidhya.com/blog/2020/10/7-feature-engineering-techniques-machine-learning/

https://www.analyticsvidhya.com/blog/2020/03/google-colab-machine-learning-deep-learning/

https://stackify.com/python-garbage-collection/

https://www.analyticsvidhya.com/blog/2020/10/overcoming-class-imbalance-using-smote-techniques/

https://towardsdatascience.com/5-smote-techniques-for-oversampling-your-imbalance-data-b8155bdbe2b5

https://www.analyticsvidhya.com/blog/2020/10/feature-selection-techniques-in-machine-learning/

https://towardsdatascience.com/data-leakage-in-machine-learning-